16 research outputs found
MLP-Mixer as a Wide and Sparse MLP
Multi-layer perceptron (MLP) is a fundamental component of deep learning that
has been extensively employed for various problems. However, recent empirical
successes in MLP-based architectures, particularly the progress of the
MLP-Mixer, have revealed that there is still hidden potential in improving MLPs
to achieve better performance. In this study, we reveal that the MLP-Mixer
works effectively as a wide MLP with certain sparse weights. Initially, we
clarify that the mixing layer of the Mixer has an effective expression as a
wider MLP whose weights are sparse and represented by the Kronecker product.
This expression naturally defines a permuted-Kronecker (PK) family, which can
be regarded as a general class of mixing layers and is also regarded as an
approximation of Monarch matrices. Subsequently, because the PK family
effectively constitutes a wide MLP with sparse weights, one can apply the
hypothesis proposed by Golubeva, Neyshabur and Gur-Ari (2021) that the
prediction performance improves as the width (sparsity) increases when the
number of weights is fixed. We empirically verify this hypothesis by maximizing
the effective width of the MLP-Mixer, which enables us to determine the
appropriate size of the mixing layers quantitatively.Comment: 19 pages, 13 figure
Attention in a family of Boltzmann machines emerging from modern Hopfield networks
Hopfield networks and Boltzmann machines (BMs) are fundamental energy-based
neural network models. Recent studies on modern Hopfield networks have broaden
the class of energy functions and led to a unified perspective on general
Hopfield networks including an attention module. In this letter, we consider
the BM counterparts of modern Hopfield networks using the associated energy
functions, and study their salient properties from a trainability perspective.
In particular, the energy function corresponding to the attention module
naturally introduces a novel BM, which we refer to as attentional BM (AttnBM).
We verify that AttnBM has a tractable likelihood function and gradient for a
special case and is easy to train. Moreover, we reveal the hidden connections
between AttnBM and some single-layer models, namely the Gaussian--Bernoulli
restricted BM and denoising autoencoder with softmax units. We also investigate
BMs introduced by other energy functions, and in particular, observe that the
energy function of dense associative memory models gives BMs belonging to
Exponential Family Harmoniums.Comment: 12 pages, 1 figur